Windows Server 2008 : Disaster Scenario Troubleshooting

11/19/2010 2:48:58 PM

This sarticle details the high-level steps that can be taken to recover from particular types of disaster scenarios. As this book and chapter focuses on Windows Server 2008 R2 environments, so shall the following sections.

Network Outage

When an organization is faced with a network outage, the impact can affect a small set of users, an entire office, or the entire company. When a network outage occurs, the network administrators should perform the following tasks:

Test the reported outage to verify if the issue is related to a wide area network (WAN) connection between the organization and the Internet service provider (ISP), the router, a network switch, a firewall, a physical fiber or copper network connection or network port, or line power to any of the aforementioned devices.
After the issue is isolated or, at least, the scope of the issue is understood, the network administrator should communicate the outage to the necessary managers and/or business owners and, as necessary, open communication to outside support vendors and ISP contacts to report the issue and create a trouble ticket. And no—this should not go out in an email if the network is down.
Create a logical action plan to resolve the issue and execute the plan.
Create and distribute a summary of the cause and result of the issue and how it can be avoided in the future. Close the trouble ticket as required.

Physical Site Failure

In the event a physical site or office cannot be accessed, a number of business operations might be suspended. Planning how to mitigate issues related to physical site limitations can be extensive, but should include the considerations discussed in the following sections.

Physical Site Access Is Limited but Site Is Functional

This section lists a few considerations for a situation where the site or office cannot be accessed physically, but all systems are functional:

Can the main and most critical phone lines be accessed or forwarded remotely?
Is there a remote access solution to allow employees with or without notebooks/laptop computers to connect to the organization’s network and perform their work?
Are there any other business operations that require onsite access that are tied to a service-level agreement, such as responding to paper faxes or submitted customer support emails, phone calls, or custom applications?

Physical Site Is Offline and Inaccessible

This section lists a few considerations for a situation where the resources in a site are nonfunctional. This scenario assumes that the site resources cannot be accessed across the network or Internet and the data center is offline with no chance of a quick recovery. When planning for a scenario such as this, the following items should be considered:

Can all services be restored in an alternate capacity—or at least the most critical systems, such as the main phone lines, fax lines, devices, applications, system, and remote access services?
If systems are cut over to an alternate location, what is the impact in performance, or what percentage of end-user load can the system support?
If systems are cut over to an alternate location, will there be any data loss or will only some data be accessible?
If the decision to cut over to the alternate location is made, how long will it take to cut over and restore the critical services?
If the site outage is caused by power loss or network issues, how long of an outage should be sustained before deciding to cut over services to an alternate location?
When the original system is restored, if possible, what will it take to failback or cut the systems back to the main location, and is there any data loss or synchronization of data involved?

These short lists merely break the surface when it comes to the planning of or dealing with a physical site outage, but, hopefully, they will spark some dialogue in the disaster recovery planning process to lead the organization to the solution that meets their needs and budget.

Server or System Failure

When a server or system failure occurs, administrators must decide on which recovery plan of action will be the most effective. Depending on the particular system, in some cases, it might be more efficient to build a new system and restore the functionality or data. In other cases, where rebuilding a system can take several hours, it might be more prudent to troubleshoot and repair the problem.

Application or Service Failure

If a Windows Server 2008 R2 system is still operational but a particular application or service on the system is nonfunctional, in most cases troubleshooting and attempting repair or restoring the system to a previous backup state is the correct plan of action. The Windows Server 2008 R2 event log is much more useful of a tool than in previous versions, and it should be one of the first places an administrator looks to determine the cause of a validated issue. Following troubleshooting or recovery procedures for the particular application is the next logical step. For example, if an end user deleted a folder from a network share, the preferred recovery method might be to use Shadow Copy backups to restore the data instead of the Windows Server Backup.

For Windows services, using Server Manager to review the status of the role and role services assists administrators in identifying and isolating problems because the Server Manager tool displays a filtered representation of Event Viewer items and service state for each role installed on the system. Figure 1 details that the File Services role SERVER10 logged several errors and warnings in the last 24 hours.

Figure 1. File Services role and role status.

Data Corruption or Loss

When a report has been logged that the data on a server is missing, is corrupted, or has been overwritten, Windows Server 2008 R2 administrators have a few options to deal with this situation. Shadow Copies for Shared Folders can be used to restore previous versions of selected files or folders and Windows Server Backup can be used to restore selected files, folders, or the entire volume on a Windows disk. Using Shadow Copies for Shared Folders, administrators and end users with the correct permissions can restore data right from their workstation. Using the restore features of Windows Server Backup, administrators can place the restored data back into the same folder by overwriting the existing data or placing a copy of the data with a different name based on the backup schedule date and time. For example, to restore a file called ClientProprosal.docx that was backed up on 10-9-09 at 12:30 p.m., Windows Server Backup will restore the file as 2009-10-09 12-30 Copy of ClientProposal.docx, and the time representation will be the current time zone of the server.

Hardware Failure

When hardware failure occurs, a number of issues and symptoms might result. The most common issues related to hardware failures include system crashes, services or drivers stopping unexpectedly, frozen (hung) systems, and systems that are in a constant reboot cycle. When hardware is suspected as failed or failing on a Windows Server 2008 R2 system, administrators should first review the event logs for any related system or application event warnings and errors. If nothing apparent is logged, hardware manufacturers usually provide several different diagnostic utilities that can be used to test and verify hardware configuration and functional state. Don’t wait to call Microsoft and involve their professional support services department because they can be working in conjunction with your team to capture and review debugging data.

When a system is suspected of having hardware issues and it is a business-critical system, steps should be taken to migrate services or applications hosted on that system to an alternate production system, or the system should be recovered to new hardware. Windows Server 2008 R2 can tolerate a full system restore or a complete PC restore to alternate hardware if the system is an exact or close hardware match with regard to the motherboard, processors, hard disk controller, and network card. Even if the hardware is exact and the disk arrays, disk IDs, and volume or partition numbers do not match, a complete PC restore to alternate hardware might fail if no additional steps are taken during the restore or recovery process.